Skip to content

Conversation

@earlephilhower
Copy link
Owner

Thanks to Yohine. He identified via email a leak of DHCP state that would cause LWIP to panic() after 256 disconnects.

Properly clean up DHCP state on link ::end (shutdown).

Thanks to Yohine.  He identified via email a leak of DHCP state that
would cause LWIP to panic() after 256 disconnects.

Properly clean up DHCP state on link ::end (shutdown).
@earlephilhower earlephilhower merged commit b832dee into master Oct 21, 2025
32 checks passed
@earlephilhower earlephilhower deleted the pop1 branch October 21, 2025 16:27
@yohine
Copy link

yohine commented Oct 22, 2025

Thank you for the quick correction and feedback.

@earlephilhower
Copy link
Owner Author

@yohine with this plus #3213 I got 3301 WiFi.begin()/Wifi.end() cycles overnight with no leaks. Unfortunately at the 3302nd loop the CYW43 chip started timing out and not responding to messages from the CYW43 driver running on the Pico. Debugging w/GDB I can see the driver try and send packets to the CYW43 and it doesn't respond w/in the timeout.

So AFAICT the binary blob running on the 2nd ARM chip has hung/died/something at this point. Nothing we can do about that here since it's completely opaque.

/*
    This sketch establishes a TCP connection to a "quote of the day" service.
    It sends a "hello" message, and then prints received data.
*/

#include <WiFi.h>

#ifndef STASSID
#define STASSID "your-ssid"
#define STAPSK "your-password"
#endif

const char* ssid = STASSID;
const char* password = STAPSK;

const char* host = "djxmmx.net";
const uint16_t port = 17;

//WiFiMulti multi;

void setup() {
  Serial.begin(115200);

  // We start by connecting to a WiFi network

  Serial.println();
  Serial.println();
  Serial.print("Connecting to ");
  Serial.println(ssid);

//  multi.addAP(ssid, password);

//  if (multi.run() != WL_CONNECTED) {
    //Serial.println("Unable to connect to network, rebooting in 10 seconds...");
    //delay(10000);
    //rp2040.reboot();
  //}
delay(5000);
  Serial.println("");
  Serial.println("WiFi connected");
  Serial.println("IP address: ");
  //Serial.println(WiFi.localIP());
}

static int l = 0;static int f= 0;
void loop() {
  static bool wait = false;
  Serial.printf("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ %d %d\n", l++, rp2040.getFreeHeap());
  //stats_display();
  WiFi.begin(ssid, password);
  if (!WiFi.connected()) {
      Serial.printf("--------------------------------------------------------------------------------- fail %d\n", f++);
      delay(1000);
    return;
  }

  Serial.print("connecting to ");
  Serial.print(host);
  Serial.print(':');
  Serial.println(port);

  // Use WiFiClient class to create TCP connections
  WiFiClient client;
  if (!client.connect(host, port)) {
    Serial.println("connection failed");
    delay(5000);
    return;
  }

  // This will send a string to the server
  Serial.println("sending data to server");
  if (client.connected()) {
    client.println("hello from RP2040");
  }

  // wait for data to be available
  unsigned long timeout = millis();
  while (client.available() == 0) {
    if (millis() - timeout > 5000) {
      Serial.println(">>> Client Timeout !");
      client.stop();
      WiFi.end();
      //delay(60000);
      return;
    }
  }

  // Read all the lines of the reply from server and print them to Serial
  Serial.println("receiving from remote server");
  // not testing 'client.connected()' since we do not need to send data here
  while (client.available()) {
    char ch = static_cast<char>(client.read());
    Serial.print(ch);
  }

  // Close the connection
  Serial.println();
  Serial.println("closing connection");
  client.stop();

  if (wait) {
//    delay(300000);  // execute once every 5 minutes, don't flood remote service
  }
  //Serial.printf("++++at end++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ %d %d\n", l, rp2040.getFreeHeap());
  //stats_display();
//delay(16000);
//  Serial.printf("++after 16s ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ %d %d\n", l, rp2040.getFreeHeap());
//  stats_display();


  //wait = true;
  WiFi.end();
//  Serial.printf("++++after wifiend++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ %d %d\n", l, rp2040.getFreeHeap());
//  stats_display();

//while(1);
}

@yohine
Copy link

yohine commented Oct 23, 2025

  • An important additional note: My current tests are only using UDP, and I'm not connecting and disconnecting each time (I do test only a few times). Therefore, please consider this information to be under different conditions. I'll be doing a similar reconnection test from now on.

I've been testing for a few days now. While I haven't reached a final conclusion yet, after making the same changes as in your #3213, the problem appears to have been resolved in my environment.

Deleting netif_remove() from netif_add() no longer causes the panic.
CYW43.begin() has also not stopped.
There are still no crashes in two days.

However, upon careful inspection, there appear to be slight differences in our fixes. These may or may not be related, but I'll introduce them for testing.

@LwipIntfDev::begin

    if (_isDHCP) {
        ip4_addr_set_u32(ip_2_ip4(&_netif.ip_addr), 0);

        netif_set_up(&_netif);
        if (netif_is_link_up(&_netif)) {
            switch (dhcp_start(&_netif)) {
            case ERR_OK:
                break;
            case ERR_IF:
                netif_remove(&_netif);
                return false;
            default:
                netif_remove(&_netif);
                return false;
            }
        }
    } else {
        netif_set_link_up(&_netif);
        netif_set_up(&_netif);
    }

void LwipIntfDev<RawDev>::end() {
    if (_started) {
        if (_intrPin < 0) {
            __removeEthernetPacketHandler(_phID);
        } else {
            detachInterrupt(_intrPin);
            __removeEthernetGPIO(_intrPin);
        }

        if (_removeNetifCB) {
            _removeNetifCB(&_netif);
        }

        RawDev::end();
	
        if (_isDHCP) {
            dhcp_stop(&_netif);
            dhcp_cleanup(&_netif);
        }

        netif_remove(&_netif);

        _started = false;
    }
}

@cyw43_spi_transfer() in cyw43_bus_pio_spi.c

        uint32_t fdebug_tx_stall = 1u << (PIO_FDEBUG_TXSTALL_LSB + bus_data->pio_sm);
        bus_data->pio->fdebug = fdebug_tx_stall;

	//! timeout !
        uint32_t start_time = time_us_32();
        uint32_t timeout_us = 1000;
        while (!(bus_data->pio->fdebug & fdebug_tx_stall)) {
            if (time_us_32() - start_time > timeout_us) {
                pio_error = 1;
                break; 
            }
            tight_loop_contents();
        }

        __compiler_memory_barrier();
        pio_sm_set_enabled(bus_data->pio, bus_data->pio_sm, false);
        pio_sm_set_consecutive_pindirs(bus_data->pio, bus_data->pio_sm, CYW43_PIN_WL_DATA_IN, 1, false);
    } else if (rx != NULL) { /* currently do one at a time */
        DUMP_SPI_TRANSACTIONS(
                printf("[%lu] bus TX %u bytes:", counter++, rx_length);
                dump_bytes(rx, rx_length);
        )
        panic_unsupported();
    }
    pio_sm_exec(bus_data->pio, bus_data->pio_sm, pio_encode_mov(pio_pins, pio_null)); // for next time we turn output on

    stop_spi_comms();
    DUMP_SPI_TRANSACTIONS(
            printf("RXed:");
            dump_bytes(rx, rx_length);
            printf("\n");
    )

    //! pio error !
    if (pio_error > 0){
        return CYW43_EIO;
    }

    return 0;
}

@earlephilhower
Copy link
Owner Author

1 difference is trivial. I just cleaned up that ugly, ugly switch. I think originally it had more to it, but you can clearly see it's doing a {netif_remove/return false} on dhcp_start != ERR_OK. So, I cleaned that up as I was going because I don't want to end up on TheDailyWTF.

2 diff in your case I think has a race condition. Because networking is IRQ driven, it would be possible for you to get an IRQ right after RawDev::end (say, DHCP retry). At that point the global netif list still has the old, ended interface and will try and call the CYW43::sendPacket which may not go so well. (TBH looking at mine I see the same thing, only a little smaller window).

3 diff, if you found a bug in the CYW43 driver then please do post something on Pico-SDK to get the fix for everyone. I don't modify the upstream SDK for this core, at all, for sanity's sake. (Also, unless you reran make-libpico.sh your change would not be used by the core...I build the SDK and ship it as a blob we link to).

I think real the diff may just be in testing methods. I was banging WiFi.begin() and WiFi.end() as fast as the chip would do it w/o actually turning on/off the AP...

@yohine
Copy link

yohine commented Oct 26, 2025

The reason for the different results could be the test content, or it could be related to the PIO or hardware. At this point, I don't think there's a clear answer.

In my long-term testing, the complete stoppage has not recurred even once. Instead, I've found another serious problem, which I'm currently investigating. This is a phenomenon where the reconnection status gets stuck at 6 after exceeding a certain number of connection attempts or time. Status 3 indicates a successful connection, but it remains at 6. Disconnecting the AP returns it from 6 to 4, but the same problem persists afterward.

I haven't yet determined the specific time or number of attempts, and I'm still collecting data. It's unclear whether this is related to the current problem, but if my PIO timeout is triggered, that might be the cause. The behavior of the CYW43 after disconnection is undefined. However, at this point, it's only a possibility.

Regarding point 3: If it's ultimately confirmed to be a PIO problem, I will report it on the Pi Forum. However, the current information is probably not enough to convince them. Based on my experience, they are unlikely to trust my report.

In my environment, all necessary sources have been removed from the static library, and everything is compiled locally. I've confirmed that creating an infinite loop in the local function of cyw43_bus_pio_spi results in a correct stop. For example, I used a command like this:
arm-none-eabi-ar.exe d liblwip.a cyw43_bus_pio_spi.c.o

Unfortunately, I have other work to do, so I won't be able to test for about a week. Therefore, the resumption of the above retesting will be after that. However, I plan to continue investigating this problem until I can solve it or until I give up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants